In this chapter, we explore the Boston Housing dataset, which comprises various attributes of housing areas around the Boston, Massachusetts area, as recorded in the 1970s. It’s a rich dataset often used for understanding the housing market through statistical learning techniques. As per the assignment, I begin by loading the data and examining its structure—highlighting key variables like crime rates, property tax rates, and median home values. Next, I try to provide a visual and statistical summary, discussing each variable’s distribution and interrelationships. Then, I standardized the data to prepare for more complex analyses, including clustering and linear discriminant analysis (LDA), which reveal insights into the socio-economic patterns affecting housing values.
Through this assignment, I’ve learned new statistical learning techniques. I gained insights into housing market patterns by performing exploratory data analysis, standardization, clustering, and discriminant analysis, and enhanced my data visualization skills further.
date()
## [1] "Mon Nov 20 17:50:47 2023"
# Loading and Exploring the Boston Dataset
library(MASS)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
##Graphical Overview of the Data
Next I’m going to use ‘pairs’ to create pair-wise scatter plots for an overview of relationships and ‘summary’ to give a statistical summary of each variable.
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
###Distributions of the Variables
crim): The distribution is
highly right-skewed, with most suburbs exhibiting low crime rates, but a
few have extremely high crime rates.zn): This
variable is also right-skewed, indicating that many suburbs have no land
zoned for large residential plots.indus): Displays a
more varied distribution with a noticeable peak around 18, suggesting a
common proportion of industrial businesses across suburbs.chas):
As a binary variable, the data shows that most suburbs do not border the
river.nox):
Exhibits a slight right-skewness, suggesting that while most areas have
moderate nitric oxide levels, some areas have significantly high
levels.rm): Appears
to be normally distributed, with most suburbs having around the median
number of 6.2 rooms.age): The distribution is
left-skewed with many houses being older, peaking at 100 years.dis):
The right-skewed nature indicates that most suburbs are close to
employment centers, with a few being much further away.rad): The bimodal distribution suggests that
suburbs are typically either very close or very far from highways.tax): Also bimodal
and likely correlated with rad, indicating variations in
tax rates depending on highway accessibility.ptratio):
Shows slight left-skewness, with most suburbs having a higher
ratio.black): This
variable is left-skewed, with most areas having a high proportion,
although some have significantly low proportions.lstat): Right-skewed, most suburbs have a lower
proportion of lower-status population.medv):
Right-skewed with a ceiling effect at 50, indicating capped median
values in the dataset.rm and medv: A positive
correlation suggests that suburbs with more rooms tend to have higher
median home values.lstat and medv: A visible
negative correlation implies that suburbs with a higher percentage of
lower status have lower home values.nox and indus: A positive
correlation indicates that more industrial areas have higher nitric
oxide concentrations.dis and nox: A negative
correlation suggests that areas further from employment centers have
lower concentrations of nitric oxides.age and nox: There seems
to be a trend where older houses are in areas with higher nitric oxide
concentrations.rad and tax: A high
correlation indicates that suburbs with better highway access tend to
have higher tax rates.# Installing and loading the caret package
if (!require(caret)) {
install.packages("caret")
library(caret)
}
## Loading required package: caret
## Loading required package: ggplot2
## Loading required package: lattice
# Standardizing the dataset
scaled_Boston <- scale(Boston)
# Printing out summaries of the scaled data
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Creating a categorical variable of the crime rate using quantiles
Boston$crime_cat <- cut(Boston$crim, breaks=quantile(Boston$crim, probs=seq(0, 1, by=0.25)), include.lowest=TRUE)
# Dropping the old crime rate variable from the dataset
Boston <- Boston[,-which(names(Boston) == "crim")]
# Dividing the dataset into train and test sets (80% train, 20% test)
trainIndex <- createDataPartition(Boston$crime_cat, p = .8, list = FALSE, times = 1)
train_set <- Boston[trainIndex, ]
test_set <- Boston[-trainIndex, ]
# Loading the MASS package for LDA
library(MASS)
# Fitting the LDA model using the categorical crime rate as the target variable
lda_fit <- lda(crime_cat ~ ., data = train_set)
# Summarizing the LDA fit
lda_fit
## Call:
## lda(crime_cat ~ ., data = train_set)
##
## Prior probabilities of groups:
## [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
## 0.2512315 0.2487685 0.2487685 0.2512315
##
## Group means:
## zn indus chas nox rm age
## [0.00632,0.082] 31.044118 4.910588 0.03921569 0.4563225 6.603618 44.69118
## (0.082,0.257] 9.504950 8.956832 0.04950495 0.4898604 6.178069 59.35446
## (0.257,3.68] 1.980198 12.547822 0.10891089 0.6002574 6.343000 82.35050
## (3.68,89] 0.000000 18.114510 0.06862745 0.6738431 5.977176 91.35392
## dis rad tax ptratio black lstat
## [0.00632,0.082] 5.567500 3.480392 282.7941 17.52941 390.6793 7.254804
## (0.082,0.257] 4.503741 4.732673 331.4554 18.28317 385.5904 11.834950
## (0.257,3.68] 2.906778 5.752475 356.7624 17.86535 366.9355 12.780990
## (3.68,89] 2.000806 23.813725 663.4216 20.14608 282.7808 19.318235
## medv
## [0.00632,0.082] 27.27451
## (0.082,0.257] 22.51485
## (0.257,3.68] 24.09703
## (3.68,89] 15.99314
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0031884105 0.0208440576 -0.041929772
## indus 0.0109161837 -0.0401399251 0.029229416
## chas -0.3279736950 -0.1285775950 0.336570696
## nox 2.7906127427 -6.4227737693 -11.173295057
## rm -0.1669832417 -0.0539775098 -0.269691841
## age 0.0111451124 -0.0174749975 -0.003479419
## dis 0.0007779010 -0.0721657174 0.063276438
## rad 0.3880996136 0.0975266139 -0.033356613
## tax 0.0001575583 0.0009460959 0.004446496
## ptratio 0.0413786927 0.0078736414 -0.131143031
## black -0.0012901167 0.0005826151 0.001713432
## lstat 0.0224876499 -0.0198067060 0.062101434
## medv 0.0174902892 -0.0368794675 -0.008922815
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9546 0.0351 0.0103
# Plotting the LDA model
plot(lda_fit)
# Saving the actual crime categories from the test set
actual_crime_categories <- test_set$crime_cat
# Removing the categorical crime variable from the test set
# Using the LDA model to predict crime categories on the test set
predicted_crime_categories <- predict(lda_fit, newdata=test_set)$class
test_set <- test_set[,-which(names(test_set) == "crime_cat")]
# Cross-tabulating the predicted vs actual crime categories
confusion_matrix <- table(Predicted = predicted_crime_categories, Actual = actual_crime_categories)
# Printing the confusion matrix
confusion_matrix
## Actual
## Predicted [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
## [0.00632,0.082] 18 5 2 0
## (0.082,0.257] 6 14 7 0
## (0.257,3.68] 1 6 14 0
## (3.68,89] 0 0 2 25
Based on the confusion matrix above we can evaluate the performance of the classification models as follows:
Key Observations from the results:
Next, I’m going to perform k-means clustering on the standardized Boston dataset to identify clusters within the data.
# Reload the Boston dataset
data("Boston")
# Standardizing the dataset
scaled_Boston <- scale(Boston)
# Calculating the distances between observations
distances <- dist(scaled_Boston)
# Installing and loading the factoextra package
if (!require(factoextra)) {
install.packages("factoextra")
library(factoextra)
}
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Determining the optimal number of clusters using the elbow method
set.seed(123)
fviz_nbclust(scaled_Boston, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
# Running k-means with the determined optimal number of clusters
set.seed(123)
kmeans_result <- kmeans(scaled_Boston, centers = 4)
# Creating a data frame for plotting
plot_data <- as.data.frame(scaled_Boston)
plot_data$cluster <- factor(kmeans_result$cluster)
# Visualizing the clusters using two variables from the dataset
ggplot(plot_data, aes(x = rm, y = lstat)) +
geom_point(aes(color = cluster)) +
labs(color = 'Cluster')
library(MASS)
data("Boston")
# Standardizing the dataset
scaled_Boston <- scale(Boston)
Next, performing the K-means clustering
set.seed(123) # for reproducibility
# According to the assignment, a reasonable number of clusters is >2, here I'm choosing 4
kmeans_result <- kmeans(scaled_Boston, centers = 4)
Now, performing LDA using clusters as target classes
# Adding the cluster assignments as a factor to the Boston data
Boston$cluster <- factor(kmeans_result$cluster)
# Fitting LDA model using the clusters as target classes
library(MASS) # for LDA
lda_fit <- lda(cluster ~ ., data = Boston)
Finally, visualizing the results with Biplot
# Biplot for LDA with arrows for original variables
plot(lda_fit)
# Examining the model's coefficients
lda_fit$scaling
## LD1 LD2 LD3
## crim -0.033252529 0.093706888 -0.003875224
## zn -0.003852921 0.005772233 -0.047264096
## indus -0.107232590 -0.066745602 -0.041117990
## chas 0.068548580 -0.197356300 0.410601305
## nox -7.766232173 -5.409634182 -7.673642525
## rm 0.139330592 0.341290080 -0.794324360
## age 0.003903578 -0.004331691 0.015529903
## dis -0.001695810 -0.034881757 -0.176213368
## rad -0.063928129 -0.015823030 -0.007533964
## tax -0.004651926 -0.002905580 -0.003796952
## ptratio -0.159544007 -0.041917959 -0.038208490
## black 0.008358645 -0.020444688 -0.004007031
## lstat -0.028069814 0.013570317 -0.056673688
## medv 0.005932882 0.013682649 -0.086102624
The LDA scaling coefficients reveal that nox (nitric oxides concentration) is the predominant variable influencing the separation of clusters on the first discriminant (LD1), indicating its strong role in differentiating the clusters. rm (average number of rooms) emerges as the most significant for the second discriminant (LD2), suggesting its importance in further distinguishing between clusters. On the third discriminant (LD3), chas (proximity to Charles River) has the highest coefficient, highlighting its influence in cluster separation at this level. These variables—nox, rm, and chas—are therefore the most critical linear separators for the clusters, with their varying scales and units contributing to their discriminant weights.
Now, the goal is to project the train data using the LDA model’s scaling matrix and then visualize the projections in 3D.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Here 'train' is the training set and 'lda_fit' is my LDA model from before
model_predictors <- Boston[, -which(names(Boston) == "cluster")]
# Checking the dimensions
dim(model_predictors)
## [1] 506 14
dim(lda_fit$scaling)
## [1] 14 3
# Matrix multiplication to get the projections
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)
Now, onto the 3D visualization…
# Installing and loading the plotly package
if (!require(plotly)) {
install.packages("plotly")
library(plotly)
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Creating a 3D scatter plot using the LDA projections
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = Boston$cluster) %>%
layout(scene = list(xaxis = list(title = 'LD1'),
yaxis = list(title = 'LD2'),
zaxis = list(title = 'LD3')))
Interpretation: